SearchEngine - Frequently asked questions
|
- Some common problems have occurred when using the SearchEngine. This
chapter lists these problems and their solutions. Questions have been
divided into two categories; the SearchEngine and the Search applet.
The FAQ index
-
- Files are not being excluded
- The SearchEngine is reading files excluded with the -xu
flag.
- SearchEngine: tags or tag attributes are being
stored in the database
- The SearchEngine is storing words which look suspiciously like
tags or tag attributes.
- SearchEngine: keywords in titles and headers
are missing
- The SearchEngine is not storing words which appear in HTML
tags like <TITLE>, <H1..H6>,
etc.
- SearchEngine: runs fine for a while, then
slows down
- The SearchEngine parses the first few hundred files, then slows
down and starts thrashing (repeatedly using) the hard-disk.
- SearchEngine: stops with an OutOfMemoryException
- The SearchEngine parses the first few hundred files, then displays
a long list of error messages, starting with OutOfMemoryException.
- SearchEngine: stops with a 'Too many files for
the search applet database' message
- The SearchEngine parses many hundreds of files, then displays a
'Too many files for the search applet database' message.
- Applet: Search button remains gray, or an
error message appears
- The applet starts up, but after a few seconds, the search button
appears grayed out, or an error message is displayed.
- Applet: Clicking on a title causes the browser
to issue a 'document not found' error
- When the user double clicks on a found document title, instead of
the browser opening the document, it issues a 'document not found'
error message.
Questions about the SearchEngine
-
- Files are not being
excluded
- The SearchEngine is reading files excluded with the -xu
flag.
-
- Take care when using the wildcard character '*'.
- The wildcard character '*' can appear at the
start of the URL, and/or at the end of the URL,
anywhere else it is treated as an ordinary character.
No other combinations of the wildcard character '*'
are valid. A filter definition of */extawt/*remove.*
will result in a (probably useless) filter to ignore all URLs
containing /extawt/*remove., and not the
probable intention of ignoring all URLs containing /extawt/
and also remove.
- The SearchEngine uses case sensitive URLs when
filtering.
- Some operating systems (Windows) are case insensitive to file
names, however, the SearchEngine is not. If for example, the
filter
-xu *.zip
was used, then all files ending in .zip will be
removed, but files ending in .ZIP will not.
Use both lower case and upper case to filter file
extensions:
-xu *.zip
-xu *.ZIP
- Tags or tag attributes
are being stored in the database
- The SearchEngine is storing words which look suspiciously like
tags or tag attributes.
-
- The HTML documents may indeed contain the tag
keywords as text, if the argument is about HTML
- Check the documents for the offending keywords, and ensure
that they are or are not inside HTML markup, watch
out for incorrectly formed comment syntax.
- The HTML document may have syntax errors,
which caused the SearchEngine to store the words in the body, or
ignore them completely.
- Check the documents for the offending keywords, and ensure
that they are inside the correct HTML markup, watch
out for incorrectly formed comment syntax.
- Keywords in titles and
headers are missing
- The SearchEngine is not storing words which appear in HTML
tags like <TITLE>, <H1..H6>,
etc.
-
- The HTML document may have syntax errors,
which caused the SearchEngine to store the words in the body, or
ignore them completely.
- Check the documents for the offending keywords, and ensure
that they are inside the correct HTML markup, watch
out for incorrectly formed comment syntax.
- Runs fine for a while,
then slows down
- The SearchEngine parses the first few hundred files, then slows
down and starts thrashing (repeatedly using) the hard-disk.
-
- The SearchEngine is running out of virtual memory.
- The SearchEngine requires about 1.5 to 2.0 times the virtual
memory, as the size of the documents being parsed. If, say, you
have 9 MB of documents, then you will require about 15 to 18 MB
of virtual memory.
Start the Java interpreter with as much virtual memory as
needed using the -mx switch (the default is 16 MB):
java -mx24m ruptools.SearchEngine ...
- Not enough virtual memory.
- Possible solutions are:
- Split the files up into sub-groups, and create databases
for each.
- Remove word groups, -nb, -nl, -nh (in that order).
- Do both, a restricted global search, with complete
sub-search.
- Increase the word exclusion list (english.exclude.html is
very generic)
- Stops with an OutOfMemoryException
- The SearchEngine parses the first few hundred files, then displays
a long list of error messages, starting with OutOfMemoryException.
-
- The SearchEngine ran out of virtual memory.
- The SearchEngine requires about 1.5 to 2.0 times the virtual
memory, as the size of the documents being parsed. If, say, you
have 9 MB of documents, then you will require about 15 to 18 MB
of virtual memory.
Start the Java interpreter with as much virtual memory as
needed using the -mx switch (the default is 16 MB):
java -mx24m ruptools.SearchEngine
- Not enough virtual memory.
- Possible solutions are:
- Split the files up into sub-groups, and create databases
for each.
- Remove word groups, -nb, -nl, -nh (in that order).
- Do both, a restricted global search, with complete
sub-search.
- Increase the word exclusion list (english.exclude.html is
very generic)
- Stops with a 'Too many
files for the search applet database' message
- The SearchEngine parses many hundreds of files, then displays a
'Too many files for the search applet database' message.
-
- The SearchEngine exceeded the applet database maximum file
size.
- The applet database can hold information on up to a maximum of
4096 HTML documents.
Questions about the Search applet
-
- Search button remains
gray, or an error message appears
- The applet starts up, but after a few seconds, the search button
appears grayed out, or an error message is displayed.
- The cause of this problem is that the applet failed to find or
load the database.
-
- Check that the file path is correct.
- The applet will look in the path made up from the codebase
plus database parameter value. Supposing the applet
definition is:
<applet codebase=".." archive="Search.zip"
code="ruptools.Search.class" width=100 height=20>
<param name=database value="docsearch">
and assuming the applet file is in the /search
directory, then the applet will look for the file in /search/../classes/docsearch.ws
or, when reduced /classes/docsearch.ws
If this is not the correct location of the database file, then
either copy the database to that location, or change the database
parameter value.
Remember that the database file must appear in the codebase
path of the applet, otherwise some browsers may refuse access to
the file, causing the applet to fail.
- Check the file path for spelling.
- On some operating systems, the filename is case insensitive
(Windows), whilst on others it is not (Unix). Ensure that the codebase
path and database parameter path have the same case
as the directories and filename. The database file extension is .ws,
in lower case.
- Check that the file path is within the codebase
path.
- As for checking the file path, ensure that the reduced file
path is the same or a child directory of the codebase,
otherwise some browsers may refuse access to the file, causing
the applet to fail.
- Check the database file.
- The database file may have become corrupt, or have been
replaced. Recompile the database, and copy the file, then try
running the applet again in the browser or appletviewer.
- Clicking on a title
causes the browser to issue a 'document not found' error
- When the user double clicks on a found document title, instead of
the browser opening the document, it issues a 'document not found'
error message.
-
- The path parameter is probably wrong or
missing.
- The path parameter is used to correct the
database document URL with respect to the search
applet HTML file URL.
If, for example, when compiling the database the root file is
specified as:
-f /rational/application/search/doc/index.htm
and the root URL as:
-u http://www.ruptools.com/rup/rational/application/search/doc/index.htm
then the root file URL will be stored in the
database as:
rational/application/search/doc/index.htm
which corresponds to the identical path in both options:
-f rational/application/search/doc/index.htm
-u http://www.ruptools.com/rup/rational/application/search/doc/index.htm
If we now suppose the search applet HTML file to
be at:
/rational/application/search/doc/docsearch.htm
for the local file, or
http://www.ruptools.com/rup/rational/application/search/doc/docsearch.htm
for the Internet URL, then we need to correct
the document URL references in the applet database
file to move back three directories:
<param name=path value="../../../../">
Now, when the user clicks on a link, the browser will
construct the URL as follows:
rational/application/search/doc/../../../../rational/application/search/doc/index.htm
for the local file, or
http://www.ruptools.com/rup/rational/application/search/doc/../../../../rational/application/search/doc/index.htm
for the Internet URL, which reduces to:
/rational/application/search/doc/index.htm
for the local file, or
http://www.ruptools.com/rup/rational/application/search/doc/index.htm
for the Internet URL.
|
| |

|